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REMARKS 

In the Office Action daltxl May 6, 2004, claims 1-28 are pending and have been 
rejecteji. Accordingly, claims J -28 are at issue. By virtue of the present Amendment, 
claims 1 and 20 have been amended. The support for this amendment can be found on 
page 5\ Imes 1-17 of the present application. With these amendments, no new matter has 
been a4ded. 



C S 103 



At section 7, claims 1-9, 1 3-22 and 26-28 are rejected undur 35 U.S.C. § 1 03 as 
aiticipated by US Patent No, 5,790,664 ("Co/(?y") in view of US Patent No 
,4b6 CSaitok''}. 
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comply 
the 

claims in 



discussed 



sspectfiilly 



E^sariiner^ 



le Examiner notes in the Office Action that "While the applicants arguments 
what is shown in the art references and what is intended by the current 
may be correct, the claims are written in a fashion so broad that they can be 
in many different ways". Applicants have revised independent claims 1 and 
clarify the invention, stating that "searching the communicailom network by 
mechanism for the installation site address" to more succinctly point out 
searching is done by a monitoring mechanism over the communications network, 
in the previous Amendment, neither Coley nor Saitoh discuss the searching 
installation site address over ttie network by a monitoring mechanism. Applicant 
requests that the Examiner enter these amendmtjnts as they are revised to 
with the Examiner's comments. Applicants believe that these changes in light of 
**s remarks and the discussion foxmd in the previous Amendment place the 
a form for allowance. 
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Claims 2-9, 13-19, 21, 22, and 26-28 are dependent rrom claims 1 and 20 and 
^atures not rcciLed in claims 1 or 20. For reasons regarding claims 1 and 20 above, 
re^pectfiilly submitted that claims 2-9, 13-19, 21, 22 and 26-28 arc also 
distingjiishable over the cited Coley and Saitoh references. 



recite 
it is 



35 U.S 



35 US<t 
speci 



fi< ation i 



At section 2 of the office action, claims 10, 11, 12, 23 and 24 are rejected under 
§ 112 first paragraph for containing subject matter that is not described in the 
in such a way to enable one skilled in the ait to practice the invention. The 
states that "the specification says only 'a search device, such as a web spider, 
soded* but goes into no detail about how this code should work." The Examiner 
ai;serts that this type of system of searching IP address or MAC addresses close to 
addfess of known pieces of equipment is not well-known in the art,.." 



Examiiler 
can be 
also 
the 



the 

inventidn 
address 
the fartAi 



C. S 112^ First paragraph 



The Applicants respectfully disagree with the Examiner because the description in 
prcsfsnt application is more than sufficient to enable on skilled in the art to practice the 
without undue experimentation. Claim 1 0 calls for "...searching a ftjitlier situ 
based on the IP address in order to determine whether the product is also used at 
er site address/' Claims 11,12, 23, and 24 contain similar language. 



'Tie searching for a site address is described in the specification on page 5 at lines 
8-16 as [bllows: 

A search device 90» similar to a web spider, which is used to search the 
xvorld-wide web over the Internet, can be coded to take the information in 
fie database 60 and search IP addresses close to the IP address, as listed in 
ttie list 64 in the database 60, for additional units that have not been 
registered. Tlie addresses close to the IP address are denoted by reference 
umeral 22, and the additional units are denoted by reference numeral 12. 
1 1 that way, it is possible to determine whether the products are used in 
compliance with licenses. Moreover, the search device 90 can also use each 
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of the MAC addresses that are assigned by the manufacturer to the products, 
as listed in the list 66, to search for additional units based on the message 
102. If additional units are found, they can also be entered into the database 
60. 



In the MPEP at 2164, the requirements for making a determination of enablement 
are discussed. There are several criteria here that demonstrate that the present invention 
is enabled. 

At 21 64.01(b) of the MPEP, it states: 

As long as the specification discloses at least one method for making and 
using the claimed invention diat bears a reasonable correlation to the entire 
scope of the claim, then the enablement requirement of 35 U.S.C, 112 is 
satisfied. In re Fisher, 427 F.2d 833, 839, 166 USPQ 18, 24 (CCPA 1970). 

The claim calls for "...searching a further site address.,/' In the specification at page 5 

lines 8-16, there is a description of several methods of how to search for site addresses. 

One such example {/« re Fisher states that the requirement is only for a single method), 

"...the search device 90 can also use each of the MAC addresses that aie assigned by the 

manufacturer to the products, as listed in the list 66, to search for additional units based 

on the message 102. " This description is a very explicit explanation of one method of 

"...searching a further site address../' as required for the claim. Searching through a 

predeteimined list, such as "the MAC addresses that are assigned by the manufacturer to 

the products" is a shnple task that is well known in the art. A freshman Computer 

student would know how write a simple loop to run through a fixed list trying to 



Science 



read menory from each device. This disclosure is clearly enabled. 



search 
is an 
(Volumt: 
This 



In another example, the specification calls for a web-spider. An Internet-based 
d^ice, such as web spider, is well-known in the art. Attached to this Amendment 
arljicle entitied "A Web Crawler in PERL" by Mike Thomas from TJnux Journal 
1997 Issue 40es) dated August 1997 that describes how to write a web spider, 
further includes source code for the implementation of a web spider. The 



article 
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imple4entalion of a web spider is well known and therefore this aspect of the claims is 
well er^abled in the specifiwition. 



Action 
ordinaiV 
progran 
applied 



Furthermore, the standard that is used to determine enablement used in the Office 
is not correct. The Office Action states that "The issue is not whether one [of) 
skill would be able to program such a system, but that they would be able to 
such a system without undue experimentation," This standard is not correctly 
here, for the MPEP at 2164.01 states: 



The fact that experimentation may be complex does not necessarily make it 
undue, if the art typically cmgages in such experimentation. In re Certain 
Limited-Charge Cell Culture Microcarriers, 221 USPQ 1165, 1174 (Inti 
Trade Conim"n 1983), affd. sub nam., Massachusetts Institute of 
Technology v. A.B. Portia, 11 A F.2d 1104, 227 USPQ 428 (Fed. Cir. 1985). 
3cc also In re Wands, 858 F.2d at 737, 8 USPQ2d at 1404. The test of 
enablement is not whether any experimentation is necessary, but whether, if 
experimentation is necessary, it is undue. In re Angstadt^ 537 F.2d 498, 504, 
90 USPQ 214, 219 (CCPA 1976)- 

And fur ther in 2 1 64,06 : 

The quantity of experimentation needed to be performed by one skilled in 

1 he art is only one factor involved in determining whether "undue 
experimentation" is required to make and use the invention. "[A]n extended 
])eriod of experimentation may not be undue if the skilled artisan is given 
efficient direction or guidance.** In re Colianni, 561 F.2d 220, 224, 195 
USPQ 150, 153 (CCPA 1977). " 'The test is not merely quantitative, since a 
considerable amount of experimentation is permissible, if it is merely 
routine, or if the specification in question provides a reasonable amount of 
f [uidance with respect to the direction in which the experimentation should 
Iffoceed; In re Wands, 858 F.2d 731, 737, 8 USPQ2d 1400, 1404 (Fed. 
Cir. 1988) (citing In re Angstadt, 537 F.2d 489, 502-^04, 190 USPQ 214, 

2 17-19 (CCPA 1976)). 



The 
respect 
app]icati|an 
use in 
crawler" 



question is whether the specification provides a reasonable amount of guidance with 
to the direction in which the experimentation should proceed. The present 
provides at least three different areas of direction for one skilled in the art to 
in^plcmentiag the present invention. The specification describes the use of a "web- 
an IP search, and a MAC address search. Routine coding of a search, either for 

10 
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address or a MAC address, are clearly noL "undue experimentation" as explained hy 



an IP 
In re J^ajids 



practice 



Buchner, 



to 
re 
specific 
present 



In fact, the m>E? continues in 2164.08 "Nevertheless, not everything necessary 
the invention need be disclosed. In fact, what is wclMoiown is best omitted In 
929 F.2d 660, 661, 18 USPQ2d 1331, 1332 (Fed. Cir. 1991)". Enamerating 
search techniques that are well known in are best omitted, as has been done in the 
application- 



not con 
uses an 
sear< 

well kndwn 



rchirg 



Conclusion 



submit 
objectii 



This is particularly important in the insjtant case because the present invention is 
emed with the programming of a search engine. Rather, the prcsent invention 
existing searching tool to search IP addresses over the Internet, General computer 
techniques have been well known since the 1950s, and web-spiders have been 
in the industry for at least 7 years. 



':iierefore, applicant requests that the Examiner remove the rejections of claims 
1 0- 12, 2j3 and 24 based upon 35 USC 1 1 2 first paragraph. 



tliat I 



oi ts 



Ii view of the foregoing Amendment and Remarks, Applicants respectfully 

claims 1-28 claim matter that is distinct from the prior art and requc^st that the 
and rejections be withdrawn. With the submission of this Amendment, this 
applicatijjn is in condition for further examination and early consideration of the claims at 
ussue and early allowance is hereby reque$tod. The Commissioner is authorized to 
charge my additional fees or credit any overpayments associated with this Amendment to 
Deposit Account 1 9-2875 (SAA-49). Applicants further invite the Examiner to contact 
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the und er&igncd representative at the telephone number below to disctiss any Tnatters 
pertaining to the present Application. 

Respectfti)Jy submitted. 



SCHNf lDER ELECTRIC AUTOMATION BUSINESS 
S<iuth Rose] le Road 
IL 60067 
: 978/975-9789 
847/925-7419 



1415 
Palatine}, 
Tcleph^ 
Facsimile 



cne; 




Richard A. Baker 
Patent Agent, Reg. No. 48,124 
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Piige I of 5 



A We b Crawler in Perl 



Here^s how 
by Mike Thanas 



spiders s-earch the Web collecting informadon for you. 



Web-crawling 
engines like 
the informatijon 



TDbots, or spiders, have a certain mystique amotig Internet users. We all use search 
Lycos and Info5;eck to find resources on the Jntcmct, and these engines use spiders to gather 
they present to us. Very few uf us, however, actually use a spider program directly 



Spiders arc 
found. So 



network applications which traverse the Web, accumulating statistics about the content 
^oW does a web spjder work? The algorithm is straightforward: 



I. 
2. 

3. 



Create 
Pull a 
be tbui^d 
Scan 
to the 
If there 



a queue of URLs to be searched beginning with one or nnore known URLs. 
1 JRL out of the queue and fetch the Hypertext Markup Language (HTML) page which can 
at that location. 

HTML page looking for new-found hyperlinks. Add the URLs for any hyperlinks found 
\ JRL queue, 
are URLs left in the queue, go to step 2. 



Listing 1 is a 
run on any 
article 

page at http:/^ 



assum 



program, spider.pl, which implements the above algorithm in Perl. This program should 
Linux system with Perl version 4 or higher installed. Mote that all code mentioned in this 
Perl is installed in /usr/bin/Perl. These scripts are available for download on my web 
www javanet. con>/- thomas/. 



The spider wi II 
correctly. Th«i 
The URL 
phrase ^^Linuk 



The Perl variibic 
output the spi i 
program and 



Interaction 



The most 
subroutine 
does the 
and uses the 
used to establish 



Tctch 



To run the spjder at the shell prompt use the command: 

spider . pi <jstartxnq-URL<3earch-phrase> 



of £ ny 



commence the search. The starting URL must be fully specified, or it may not parse 
spider searches the initial page and all its descendant pages for the given search phrase. 

page with a match is printed. To print a list of URLs from the SSC site containing the 
Jnumar\ type: 



spider, pi http ; //www . ssc . com/ "Jyinux Journal' 



$D£BUG, defmed in the first few lines ofspider.pl, is used to control the amount of 
ler produces. $ debug can range from 0 (matching URLs are printed) to 2 (status of the 
lumps of interna! data structures are output). 



with the Internet 



intejresting thing about the spider program is the fact that it is a network program. The 

_http 0 encapsulates all the network programming required to implement a spider; it 
" alluded to in step 2 of the above algorithm. This subroutine opens a socket to a server 
.P protocol to retrieve a page. If the server has a port number appended to il, this porl is 
the connection; otherwise, the well-known port 80 is used 



HTTP I 



Once a conne;tion to the remote machine has been established, q^t httpo sends astring such as: 



http!//deliven|.aem,ora/l 0. 1 145/3 30000/326923/a 1 2-thomas.html?kev 1 -326923&kev2-(J8 5/20/2004 
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OTT /index 

This string ii 
(ITTTP), the 
connected lo 
end of file is 
conversation 



followed by two newline characters. This is a snippet of the Hypertext Transport Protocol 
protocol on which the Web is based. This request asks the web server lo which we are 
send the conlente of the file /index.html to us. g^t_http ( ) then reads the socket until an 
encountered. Since HTTP is a connectionless protocol, this is the extent of the 
We submit a request, the web server sends a response and the connection is terminated 



The response 
HTML-tagg- 
Running the 
following is 



from the web server consists of a header, as specified by the TITTF standard, and the 
i text making up the page. These two parts of the response are separated by a bliink line, 
spider at debug level 2 will display the HTTP headers for you as a page is fetched. The 
typical response from a web server. 



HTTP/ 1.0 2 
Date: Tue, 
Server: Ap 

Conr.^.tit-1 e 
Lasti-modi f -i 



yi e 



<HTMT,><TITl|E>My 
<BODY> 
Thi 5 is my 
</BonY> 
</HTML> 



The spider prbgram 
any MIME 
consuming 
use the FTP 



There is a haijdware 
a SPARC or 
to encode i 



$sockaddr=" 



We can also 
search. This 
fonn<A HREi 



A hyperlink 
URL of the 
function. It cdmbines 



html HTTP/ 1.0 



0 OK 

11 Feb 1997 21:54:05 GMT 

t^xt/btml 
gth: 79 

ed: Fri, 22 Npv 1996 10:11:48 GMT 



Vi&h Page</T^TLE^ 
web p£icjs . 



type* 



checks the Content-type field in the iriTP header as it arrives. If the content is of 
other than text/html or lext/plain* the download is aborted. This avoids the time- 
d< wnload of things like .Z and .tar.g7 files, which we don't wish to search. While most sites 
f] rotocol to transfer this type of file> more and more sites are using HTTP. 



dependency in get_ http ( ) that you should be aware of if you are running Linux on 
Alpha. When building the network addresses for the socket, the Perl p^ick ( ) routine is used 
inte ger data. The line: 



is suitable on^ for 32"bil CPUs. To get around this, see Mike Mull's article "Perl and Sockets" inU 
Issue 35. 

The URL (lueue 

Once the spidsr has downloaded the HTML source for a web page, we can scan il for text matching the 
search phrase and notify the user if we find a match. 



find 



in 



any hypertext links embedded in the page and use tliem as a staiting point for a further 
exactly what the spider program does; it scans the HTML content for anchor tags of the 
ur 1 "> and adds any links it finds to its queue of URLs. 



pi^ge 



an HTML page can be in one of several forms. Some of these must be combined with the 
in which they're embedded to get a complete URL. This is done by the £qURL < ) 
the URL of tlie current page and the URL of a hyperlink found in that page to 



hLtp://deliver>iacm.or^lOJ 1 45/3 3 0000/3 2 692 3/al 2 -thorn as. htnal?kev 1=^3 26923 &kev2=0 K. .. 5/20/9:004 
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produce a cc 



For example 
hUp://www. 



here are some iinks which might be fomid in a Hotitious web page al 
.<ldil.cx)ra/clients/indcx.htTifil, together with the resulting URL produced by f qUKi. < > 



URL in Anchor 

hLtp://www, 
att.hlml 
/att.htfnl 



Tag Resulting URL 

( >ee.org/index.html http://www,eee.org/'index,html 

http://www,ddd,com/clients/atl.htm! 
http://www.ddd.conVatt.html 



As these exai 
document 
relative path. 
htip://wwwj 



nane 



mples show, the spider can handle both a fiilly-specificd URL and a URL with only a 
^e. When only a document name is given, it can be cither a fully qualified path or a 

In addition, the spider can handle URLs with port numbers embedded, e.g., 
ddd,com: 1234/index.html. 



One function 
the URL 
document. 



, /tesi/, 



Another issue 
circles. Cbrcu 
page B has a 
out. On B it 
avoid getting 
return. Step 2 
our queue 
that URL as 
knowing it 



The set of p; 
and and shrink 
be traversed } 
here. The associative 
database with 



Responsibli^ Use 



Note that you 
consumes, but 
GET request- 
giving the 



http://deliver> 



mplete URL for the hyperlinV. 



not implemented in f qURL ( ) is the stripping of back-references (- - /) from a URL. Ideally, 
/.../indcx.hUml is translated to /index.html, and we know tliat both point to the same 



Once we hav; a ftilly-specified URL for a hyperlink^ we can add it to our queue of URLs to be scanned. 
One concern that crops up is how to limit our search to a given subset of the Internet. An unrestricted 
search would end up downloading a good portion of the world-wide Internet content—not something wc 
want to do to our corapadres with whom we share network bandwidth. The approach spider.pl takes is to 
discard any LRL that does not have the same host name as the beginning URL; thus, the spider is 
limited to a single host. We could also extend the program to specify a set of legal hosts, allowing a 
small group qf servers to be searched for content. 



that arises when handling the links we've fouod is how to prevent the spider from going in 
ar hyperlinks are very common on the Web, For example, page A has a link to page and 
link back to page A. If we point our spider at page A, it linds the link to B and checks it 
f nds a link to A and checks it out. This loop continues indefinitely. The easiest way to 
trapped in a loop is to keep track of where the spider has been and ensure that it doesn't 
in the algorithm shown at the beginning of this article suggests that we ' pull a URL out of 
visit it. The spider program doesn't remove the URL from the queue. Instead, it marka 
1 aving been scanned. If the spider later finds a hyperlink to this URL, it can ignore il, 
already visited the page. Our URI- queue holds both visited and unvisited URLs. 



ard 



has 



the spider has visited will grow steadily, and the set of pages it has yet to visit can grow 
quickly, depending on the number of hyperlinks found in each page. If a large site is to 
ou may need to store the URL queue in a database, mther than in memory as we've done 
array that holds the URL queue, %URLqueue, could easily be linked to a GDBM 
the Pert 4 functions dbmopen ( ) and dbmclose < ) or Perl 5 functions t ie f ) and untie ( ) . 



should not unleash this beast on the Internet at large, not only because of the bandwidth it 
al&G because of Internet conventions. The document request the spider sends is a one Ime 
o strictly follow the ITFTP protocol, it should also include us&r-Agent and From fields, 
server the opportunity to deny our request and/or collect statistics. 



ref] lote 



acm.org/10.1145/33{)000/326923/al2-thomas.htral?keyl=326923&kcy2=Oii... 5/20/2(M>4 
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This prograr^ 
robots. The 
scannmg 
that excludes 



lile 

; frojn 



Useuagent. ; 

Disallow: 

Disallow: 



A file that pr jhibits all scanning on a particular web server looks like this: 



User -dy^nt 
Disallow : 



Robots like dur 
have been dei :larcd 



Application of the spider.pl Script 



we 



i 'ot I 



How might 
replacQtwcnt 
(ht1|>://hiirvt:s 
(http://www 
provide the 
query engine 
engine runs 
be regeneratcB 



Some search 
engines build 
across a nctw 
these tools w<^uld 
the coining 



Listing 2 and 
program. Listing 
spiderfindxgi, 
associative 
form data as 
usef's browse] ■ 
name of the 
more security 
wanted to use 



This script 
perform as' 
have the 
However, this 



Another appli 
The HTTP 



also ignores the ^^robots.txt" convention that is used by adminisimtors to deny access Lo 
/robots.txt should be checked before any further scanning of a host. This file indicates if 
a robot is welcome and declares any subdirectories that are ofT-limits. A robots.txt file 
scanning of only 2 directories looks like this: 



trap/ 

cgi -bin/ 



spider can place a heavy load on a web server, and we don't wish to use it on servers that 
ofT-liraits to robots by their administrators 



use the spider program, other than as a curiosity? One use for the program would be as a 
one of the web site index and query programs like Harvest 
.cs.colorado.edu/Harves1/) or Excite for Web Servers 
ixcite.com/navigate/prodinfo.htral). These programs are large and complicated. They often 
fimctionality of the Perl spider program, a means of archiving the text retrieved and a CGI 

to run against the resulting database. Ongoing maintenance is required, since the query 
a jatnst the database rather than against the actual site content; therefore, the database must 
whenever a change is made to the content of the site. 



engines, such as Excite for Web Servers, cannot index the content at a remote site. These 
their database firom the files which make up the web site, rather than from data retrieved 
>rk. If you had two web sites whose content was to appear in a single search application, 

not be appropriate. FurthermorCj the Linux version of Excite for Web Servers is still in 
soon" stage. 



Listings show a simple CGI search engine that is implemented using the spider-pl 
2 is.an HTML fonn which calls spiderfindcgi to process its input. Listing 3 is 
Tt first uses Brigitte Jellineks library to move the data entered in the form into an 
. It then calls the spider.pl program using the Perl ay stent ( > function and passes the 
j^arameters. Finally, it converts the output from spider.pl into a series of HTML links. The 
will display a list of hyperlinked URLs in which the search text was found. Note that the 

to search is specified by a hidden field in the HTML docurnent. There are better and 
■conscious ways for two Perl programs to interact than through a Perl systtim ( } call, but 1 
an unmodified copy of spider.pl for this demonstmtion. 



ar-ay 



hDSt 



dofcsn^t provide the complete ftmctionality of the packages mentioned above, and it won't 
Since we*re doing the search against web server documents across the Net, wc don't 
adva|itage of index files; therefore, the search will he slower and more processor-intensive, 
script is easy to install and easier to maintain than those engines. 



:ation that could be built using the spider. pi program is a broken link scanner for the Web. 
we showed previously began with the line ' ' http/ i , o 2 o a ok'\ indicating the 



re< ponse 
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The moditicttions 
URL is added 
hyperlink. TJien. 
outputs the IJRL 



request coul 1 be fulfilled. If we tried to hit a URL with a non-existent document, wc would get tlie line 
^HTTP/i . 0 404 Not found" instead. We could use this as an indication tliat tlic documenl does not 
exist and prijit the URL which referenced this page. 




to the spider program needed to accomplish this are minor. Every time a hy perl ink* s 
to the URL queue, we also record the URL of the document in which wc found the 
.when the spider checics out the hyperlink and receives a ''401 Not found" response, it 
of the referring page. 

Mike Thomas is an Internet application developer working for a 
i canstuhingfirm in Saskatchewan, Canada. Mike lives in Massachusetts 
f^and uses two Linux systems to telecommute 2000 miles to his Job and io 
I Graduate School at the University o/Regina, He can he reached by e- 
imail at thomas@javanet.com. 
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#! /ufir/bijn/perl 
# 

ft apiderr.piL 
# 



$1=1 
# 0-no 
$DEUUG = 0 

Check hy^ 
55PANaOST$ 



debug ^ 1- display proqr^^^s, 2-coinpleLe dump 



if (scalar ( 
print "Usa 
exit 1 ; 
) 



(lARGV) < 2) { 

^e: $0 <fuJ ly-qualifi*?ic3-URL> <search-phra3e>\n" ; 



# Initiaii 
%URLqueue 
chop ($cl ie 
$been = 0; 



# Load the 
$tnisURL = 



# While 

while ($thidURL ne 



there's a \jRh in our queue which we h>:ivef>'t looked at 



Progress 
^icount = 0; 
while ( ($k.e\ 



) 

print 

prij^tf ("Be 
if (SDEBUG>4 
print "Cu 
£duiinp__$tacV 

* s^plit the 
(^protocol. 



4 If the 
if ($protoco|l 



Set t;jb3tops to 3. 



erlinl<;3 to other hosts? 
of f"? 



<) ; 

t host^ ^hoGtnatrie ' ) ; 



queue with the first URL to hit. 

ARGVtOj } - G? 

iff ind new ( ?iDRLcfueue) ; 



r&port , 



, $value) = each (%URLqueue) } { 
t 



en : 



^ ^^f^M ^-^ ($DEBuc:>= i ) ? 

%d To Go; %d\n'\ $been, ?count-$been) 



1) ; 

?nt URL: $thisURh\ri" i f ( 5DEBUG>=1) ; 
{) if ($DEBUG>-2) ; 

protocol from the URL. 

$rest) = $thi3URL — mr ( ['^ : /J *):(.*)$ f ; 



protocol 



is http, fetch the page and proo^^o .'i t . 
eq "http") { 



# Split out the hostname, port and document. 
($s|Brver host, $port:, Sdocument) - 

5rest -~ mr//(I":/]*):*([Q-9]*)/*C[^:]'^)$l; 



Get 



the pa^e of text and remove CR/LF ch^rncters and HTHL 
clomments from it. 

e_text ^ &get_http ($client._hC3t^ $server_host, $port^ 
$ document) ; 
$pahe_text tr/\r\n//a; 
5pa5e_text s|<! — — >llg; 



# Report if our search string is found here. 
if(5page text m | $ gea rch_phrase t i) { 
prirtt "$thisURL\n" 
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} 

# 
(0 

^>]*)"lgi; 

fo 



# Add a zero record for URLs we h^iven ' t 

# encountered. 

# Optionally, ignore URL's; which point to fjther 

# hosts. 
<$new host) = 

$newURL — mr :*l0-9]*/*[^;]*:?i. 
if ($SPANHOSTS "on" II $new_host &q 
$3ervGr host) { 
$URLqueue { $newURL} -0; 

} 



print "(Protocol '$protocol' ignored. \n" if ( $D£:buG>=1 ) ; 



# Record t 
$URLqueue {$ 
$been 4 + ; 
$thisURL 



exit; 



sub fqURL 

{ 

local ( 
local ($ba$ 



i f ($hcH3_pro to 



H 1 



ind anchors in the HTML and update our list of UKLg,. 
rtchorG) = 5page_text m | <A [ ^> ] ^HR£F\ 3*^\s* " ( f 

each Sanchor (Qanchoxs) { 

$newURL = ^fqURL($thisURL, $anc:hor) ; 
if ($URLqueue($newURL} > 0)^ 

# Increment the count for URLs we've already 

# checked out. 
$URLqueuei $newURL) ++; 



lelse { 



h|e fact that we ' ve^ beeri here, and get a new URL to procG^ 
t.hiaURT.) ++; 

&f ind newC^URLque^uf^) ; 



fr 

B\i.i.ld a fblly specified URL. 



^thisIURLf $anchor) ^ 

_proto, $has_iead slash, $c-urrprotr $currhost, $newURL) ; 



* Strip anything following a number flign becfiUS^i its 

4 just A ~ 
^anchor 



rence to a position within a page. 



# Examine ahchor to soft what part?i of t>ie UBL are specified. 
$has_prato = 0? 
$has_lead s Lash='0; 

$has_prot.o = 1 3f($anchor mi ^ ["/;} + : I ) ; 
Shas Jead slash - 1 if {$anchor mr/|); 



- 1) { 

t" protocol 3pecltied, assume anchoi; is fully qualified. 
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SriswlTRL = ^anchor; 



) 

eisif ($1)33 
# 



f cJocument has a leading slash, it just needs protocol and ha.st. 
urrprot, $currbo3t) ^ $CbisURL — m r ( | ^ : / I M : /+ ( r ; / ? * ) | ; 
$rl^JwURL $currprot , "://" . Scurrhost . $anchor; 



# Anchor must be just relative pathname, append it to current URt. 

($iiewURL) - SthisURL =~ m r /]*$( ; 

$nt:wURL .= if (! (Sn^wURL in|/$i))/ 
$n^wURL $anc;hor; 



} 

if($DEBUG ? 

pri 

1 

return $ne 
) 



# 0D a linear search of the url stack to find a URL with a c3ata 

# value of 0 (i,e, one we haven't checl^ed out yet) 



$Db find 
( 

local ( 



%URLquGue 



vfhile ( ( $key 
re 



return 
} 



# Debugging 



sub dump_st303c 
{ 

local ($key, Sx) ? 
local ($donel^ $togo) 



I 

print "Reen 
print "To 

print " 

read (STDIN 



lead slash 



1) i 



=2) { 

nt "Link Found\n In : $thi sURL\n Anchor : $anchoi:\n Result: ^inftwURtAn'^ 
1.; 



T.u 



$ value) ; 

, $valu©) = t?ach <%URLqueue) ) { 
rn $key if($valU(^ == 0); 



uti] ity . 



(' 



foreach ( Veys (%URLqueue) ) { 

if { 5URLqueu&( $key } == 0)^ 

$togo " " . $)tey 

^el^e^ 

$done " " . 55tey 
- $URLqueue { $ key } 



(hi t count 
M\n"; 



Go 



There :\n" . Sdone; 

Xn** . $togo; 
— Hit Q to Quit, Enter to Continue 
$key, 1); 
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exit (J) it 
J 



# Get the 
#---^ 



page indicated by the $5erver_host and ^dor^viment parriineters . 



sub get htt|p 
I 

local (9cl i^nt_host, $&^rvez hos^Lr Sport, 5docume>nt) 
local ($name , $c»liases, Stype, $len) 
.1 ocal ($thif , $thi3addr, $t.hat, $thataddr ) ; 
iQcal (5cIi€nt_ho8t/ $30Ckaddr, $a , $b, $c, $d) ; 
local ($pag^, $heacier, $header text, ^content) 



# Some cons 
$AF_INET-2; 
$30CK stre;^m^ 



defau It 



# Use 
Sport = 80 

# Get the 
($na/iie, $al 



http port if none spscified, 
if t$port: 0) / 

pjrotocol number for TCP. 

es, $proto> =qetprOtobyname ("tcp") ; 



# Get the I 
(Sname, $al 
( $name, $ali 



P addre^ises for the two hosts. 

es, $type, $len, $t>i.i.&addr) - gethostbyname ( $client_host ) . 
*es, $type, 91en, $thataddr) = getbostbyname (^server host), 



# Check we 
if ($a eq 

pr 



# Pack the 

# addresses 

# this is 

# architect 
$soc)c3ddr=" 
5this^paG3c ( 
$that=pacl^ 



# Cr^.ate th 
if (socket (S 
pri 
ret 

} 

print "Sock 
if (connect ( 
pri 

ret 



$key '0' II ?key eq 'q'); 



tants used to access the TCP network. 
'=1; 



could resolve the server host name. 
= unpack ('C4'^ $thataddr) ; 
$b && $c eq && $d eq 

t "ERROR: Unknown host $3erver host . \n" ; 
rn 



$d) 



} 

print "Servler; $server_host ( $a . $b . $c . $d) \n" i f {$qebug>=-;? ) ; 



ftr_INET magic number, the port, and Che (already packed) IP 
into the same format as the C structure would use. Note 
chitecitur0 dependent: this pack format works for 32 bit 
Lires . 

5 n a 4 x8"; 

Ssockaddr, $AF_INET, 0, $thisaddr) ; 
30ckaddr, $AF INET, $port, $thataddr) ; 



ar 



s socket and connect. 

$AF_INE'r, $SOCK_STREAM, $proto) — false) { 
It "ERROR: Cannot create siocketAn"; 
Jrn 

t OK\n" if ($DEBUG>=2) ; 
5, 5;that) == false} { 

It "ERROR: Cannot connect to server $server host, 
port $port.\n'*; 

lirn 



> 

print "Conn|:^ct OK\n" if ( $DEBUG>>»»»=2) ; 

# Turn buffering in the socket off, and send req\je$it. to the server, 
select (S); ?1 =^1; select (STDOUT) ; 
print 5 "G£x /$document HTTP/l . 0\n\n"; 
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# Receive* ' 
^ type tex- 
$page = 
Sheader ^ '. 

whi le <<s>> 



te:ct = 



line) 



ensure 



4 

if 



# <t:hec)c if we've hit th^ en<l of the HXT? header (an empty 

f we have* check for a content-type heAdfir line, and 

t- is valid- 
inr[\n\r]*$l )i 
5 header = 0,- 

($content) $header_t ext -^^ in t Contt^nr.-f.ypR: (\S + )|i; 
af($content "text/html" && $content ne "t.^.xl-/pl ain" M 
print "Conrent type 'Scontent' ignored. \n" 
if ($DEBUG>=1> ; 

) 



ave to a header string if we* re still working on the htyp 
header . 
elii f (^header 1) { 

$header_text " " . $ ; 

} 

# C^therwise, save to the html page string, 
el! 



] 



print "HTTP 
return 
} 



Spaqe 



be respcn.^e. Check to ensure the refipnn^e is of MiME 
/ht-ml or texc./plain. 



header: \n $headt;r text*' if A$T>EBVG>-^} 
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#!/usr/bi.n 

# Spiderfi 

# tJote: mu: 
$1=1; 

Use Brig, 

# data into 
require { 
fiGetF.ormftrc 

$search - i 
$url = 5fo 



t Build a dommand using the dati^ passed from the 

# form. Note the quotes around the data from the 

# form are vital. They prevent a web user trata 
^ entering a search string like 
^ "test; cc /; rm-r 

# and deleting every file the web server user has 

# access tc 



$cmd ^ sprintf ./spider, pi "%3' 
$fqrm_data{" search") ) ; 



pri nt 
print 
print 
print 
$re3Ult 

print $res 
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perl 
d , cgi 

t set $DF.BUG-D in spider.pl. 



tte Jellinelc's library to get form 

the array %f orii3_data , 
b;iellis ,pl") ; 
3 <) ; 



f orm_data V'^^^-i^c:!'!" ) ; 
m data { "uri" ) ; 



n Ci " » 



$forin data ("url " } , 



# Run the dommand rind wrap the results up in HTML 
4 and print it back to the web server, 
Sresult = ^5cmd' 

print "Conte^nt-type : text /html NnNn"^ 

<HTML><TITLE>Search ResultS</TTTl.K>\n ** / 
earch Results for '$search* ' 
$url'</H2>\n"; 
BOjDY></HTML>" ; 
Si ( [''Nn] 

bref="$l">$l</A><BR>\n|g; 



''<BOI](y><H2>S€ 
"on 



ult; 
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<HTML> 

<TITLE> Mike's W^bCrawler Pemo </T2Tl.E> 



<BODy> 
<II2> Mike' 
<P> This fi 
search for 
This is a . 
selected 
time, </P> 



WebOrawier Demo</fi2> 
rm will invoice Mike* 3 webcrawler to 
t>ie phrase you enter in the forin b^^^ow. 
ive search;, so if the conn^^ction to the 
hqst is $J.ow, the response may take some 



<FOf^M ACTI<:iN-"spiderf ind-cgi" METHOD-POST> 
Seirch String: <INPUT TYPE="TEXT"i 

NAME-"search" SI21£=4 0></BR> 
<I1^PUT TYPE-"HIDDEN" NAME="url" 

VALUE-"httpt //www . gsc . com/"> 
<II<PUr TYPE=SUBMIT> 

</FORM> 



</BODY> 
<:/HTML> 
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